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Abstract 



We study the problem of estimating the Earth Mover's Distance (EMD) between prob- 
abihty distributions when given access only to samples. We give closeness testers and 
additive-error estimators over domains in [0, A]'', with sample complexities independent of 
domain size - permitting the testability even of continuous distributions over infinite do- 
mains. Instead, our algorithms depend on other parameters, such as the diameter of the 
domain space, which may be significantly smaller. We also prove lower bounds showing 
the dependencies on these parameters to be essentially optimal. Additionally, we consider 
whether natural classes of distributions exist for which there are algorithms with better 
dependence on the dimension, and show that for highly clusterable data, this is indeed the 
case. Lastly, we consider a variant of the EMD, defined over tree metrics instead of the 
usual £i metric, and give optimal algorithms. 

1 Introduction 

In traditional algorithmic settings, algorithms requiring linear time and/or space are generally 
considered to be highly efficient; sometimes even polynomial time and space requirements are 
acceptable. However, today this is often no longer the case. With data being generated at rates 
of terabytes a second, sublinear time algorithms have become crucial in many applications. In 
the increasingly important area of massive data algorithmics, a number of models have been 
proposed and studied to address this. One of these arises when the data can be naturahy viewed 
as a probability distribution (e.g., over IP addresses, or items sold by an onhne retailer, etc.) 
that allows i.i.d. samples to be drawn from it. This is the model on which this paper focuses. 

Perhaps the most fundamental problem in this model is that of testing whether two distribu- 
tions are close. For instance, if an online retailer such as Amazon.com wishes to detect changes 
in consumer habits, one way of doing so might be to see if the distribution of sales over all of- 
fered items this week, say, is significantly different from last week's distribution. This problem 
has been studied extensively, mostly under the £i and £2 distances, and algorithms sublinear in 
time and sample complexity exist to distinguish whether two distributions are identical or e-far 
from each other [ll[7l[T8]. However, under the £1 distance, for instance, the sample complexity, 
though sublinear, can be no smaller than n^/^ (where n is the domain size), which may be 
prohibitively large [H [18] . 

Fortunately, in many situations there is a natural metric on the underlying domain, under 
which nearby points should be treated as "less different" than faraway points. This motivates a 
metric known as Earth Mover's Distance (EMD), first introduced in the vision community as a 



measure of (dis) similarity between images that more accurately reflects human perception than 
more traditional ii It has since proven to be important in computer graphics and vision 

[l3l [l2l [HI m [m [m [l5] , and has natural applications to other areas of computer science. As a 
result, its computational aspects have recently drawn attention from the algorithms community 
as well [21 [9l El |8] . However, previous work has generally focused on the more classical model of 
approximating the EMD when given the input distributions explicitly; that is, when the exact 
probability of any domain element can be queried. As far as we know, no work has been done 
on estimating and closeness testing of EMD when given access only to i.i.d. samples of the 
distributions. 

In this model, it is easy to see that we cannot hope to compute a multiplicative approxima- 
tion, even with arbitrarily many samples, since that would require us to distinguish between 
arbitrarily close distributions and identical ones. However, if we settle for additive error, we 
show in this paper that, in contrast to the ii distance, we can estimate EMD using a num- 
ber of samples independent of the domain size. Instead, our sample complexities depend only 
on the diameter of the domain space, which in many cases can be significantly smaller. The 
consequence is that this allows us to effectively deal with distributions over extremely large do- 
mains, and even, under a natural generalization of EMD, continuous distributions over infinite 
domains. 

Specifically, if p and q are distributions over M C [0, A]'^ (where d is a constant), we can 

• estimate EMD{p,q) to within an additive error of e with 0((A/e)°'+'^(i)) samples, 

• distinguish whether p = q or EMD{p,q) > e with 0((A/e)^'^/'^+'^(-'^)) samples, and 

• if g is known, distinguish whether p = qov EMD{p,q) > ewithO((A/e)'^/2+0(i)) samples. 

We also give lower bounds that imply these results to be essentially optimal (up to small 
poly{A/e) factors). In the case of d = 1 or 2, upper and lower bounds for both testers become 
e((A/e)2). 

Additionally, we consider assumptions on the data that might make the problem easier, and 
give an improved algorithm in the case our input distributions is highly clusterable. Finally, it 
is natural to consider the EMD over domains endowed with a metric other than ii distance. We 
give an optimal (upto polylogarithmic factors) algorithm for estimating EMD over tree metrics. 

2 Preliminaries 

2.1 Earth Mover's Distance 

We start with the following definition. 

Definition 1 A supply-demand network is a directed bipartite graph G = {S L)T,E) consist- 
ing of supply vertices S and demand vertices T, with supply distribution p on S and demand 
distribution q on T , and edge set E = S xT, with associated weights w : E ^ R"*". A satisfying 
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flow for G is a mapping f : E M"*" such that for each s £ S and each t £ T, 



J2fiM) = p{s), and 



t'&T 



s'es 



The cost of satisfying flow f is given by 



C{f) = Y,f{e)w{e). 



eeE 



We define the Earth Mover's Distance (EMD) as follows. 

Definition 2 Let p and q be probability distributions on a finite metric space (M, 6) . Then 
let G be the supply-demand network given by supply vertices S = {sx \ x G M} and demand 

vertices T = {tx \ x G M}, with supply distribution p : Sx p{x) and demand distribution 
q : tx ^ q{x), and edge weights w : {sx,ty) i-^ d{x,y). Define EMD{p,q) to be the minimum 
cost of all satisfying flows for G. 

It is straightforward to verify that the EMD as defined above is a metric on all probability 
distributions on M. Note, moreover, that to upperbound the EMD it suffices to exhibit any 
satisfying flow. 

Since the magnitude of the EMD depends on the distances in the underlying metric space 
M, we must assume that M is of bounded diameter. In particular, we will in this paper focus 
mostly on the case where M C [0, A]'^, for some A > 0, endowed with the £i metric. 

Finally, we deflne a closeness tester for the EMD in the usual way as follows. 

Definition 3 Let p, q be two distributions on (finite) metric space M . An EMD-closeness 
tester is an algorithm which takes as input samples from p and q, together with a real number 
£ > 0, and guarantees that 

(1) if p = q, then it accepts with probability at least 2/3, and 

(2) if EMD{p,q) > e, then it rejects with probability at least 2/3. 

Note that it is also possible to define EMD for any two probability measures p and q in an 
infinite (continuous) metric space {A4,5m), M C [0, A]"^, by using Wasserstein metric: 



where r(p, q) denotes the collection of all measures on A4 x M with marginals p and q on 
the first and second factors respectively. By using e/4-net, {M,6m) can be discretized into a 
finite metric (M, 6) in such a way that the EMD distance of any two probability measures only 
increases or decreases at most e/2. Therefore, an E'MZ'-closeness tester with additive error e/2 
for (M, S) is also a valid £^M£)-closeness tester for {M, S) with additive error e. 
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2.2 Properties of EMD 

Let us get a firmer handle on the EMD by relating it to ii distance, via the following lemmas. 

Lemma A If p and q are distributions on {M,6), there exists a minimum cost satisfying flow 
f from S to T (as defined in Def. \^ such that the total amount sent by f across edges with 
non-zero cost is exactly \ \p — q\\i/2. 

Proof Let a = J2^mm{p(x),q{x)} and A = niax{p{x) , q{x)} , so that A — a= \\p — q\\i 
and A + a = 2. Observe that the total amount sent from 5 to T is 1, and the maximum possible 
amount sent across edges with zero cost is a, which leaves at least 1 — a = \\p — q\\i/2 to be 
sent across non-zero cost edges. 

On the other hand, suppose that for some x, an optimal flow through the edge {sx, t^) is less 
than min{p{x), q{x)}. Then there exist points y and z such that / sends at least a > from Sx 
to ty and from Sz to tx- We can replace this partial flow, which costs a{5{x, y) + 6{x, z)), with 
one sending a from Sx to tx and from Sz to ty, which costs a{6{y,z)). Doing so, we will not 
increase the total cost (by triangle inequality), nor will we affect the rest of the flow. We can 
therefore repeated the procedure until we obtain an optimal flow that saturates every zero-edge. 



Corollary 5 Ifp and q are distributions on M , where M has minimum distance 6 and diameter 
A, then 

\^P^.6<EMD{p,q)<\-^^^-A. 

Lemma 6 Let p and q be distributions on M with diameter A, and A4 = {Mi, . . . ,Mfc} be a 
partition of M wherein diam(Mj) < F for every i G [k]. Let P and Q be distributions on A4 
induced by p and q, resp. Then EMD{p, q) < H^^^ ■ A + T. 

Proof Let us define distribution p' by moving some of the probability weight of p between 
Mi's (taking from and depositing anywhere within the respective Mi's) in such a way that p' 
induces Q on A4. This is effectively a flow from P to Q where all distances are bounded by A, so 
by LemmaHit can be done at cost at most ^^^-3^ • A. It follows that EMD{p,p') < ^l^-^^ • A. 

Then, having equalized the probability weights of each Mi, let us move the probability weight 
of p' within each Mi to precisely match q. This might require moving everything (i.e., 1), but 
the distance anything is moved is at most T, so EMD{p' ,q) < T and the lemma follows by 
triangle inequality. ■ 



2.3 Some tools from previous work 

Here, for completeness, we state some results from previous work that we will make use of 
later. First, we define some useful distributions described in [9j for testing closeness between 
distributions over subsets of [0, A]''. 
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Definition 7 Given distribution p over M C [0, A]'^ and a positive integer i, let G^*^ be a grid 
with side length ~ over [0, A]*^ centered at the origin. Define the i-coarsening of p, denoted 
p^'^\ to be the distribution over the grid cells of G^*^ such that, for each grid cell c of G^^\ 

The p*-*) 's can be thought of as coarse approximations of p where ah points in the same grid 
ceh are considered to be the same point. We then have the following lemma from [9j relating 
the EMD of two distributions to a weighted sum of the ii distances of their i-coarsenings. 

Lemma 8 For any two distributions p and q over M C [0, A]'^, 

/log(2Ad/e) 
EMDip,q)<dl ^ — ^. 

Having established the various relationships between the EMD and the £i distance, we will 
make use of the result from [1] below for testing closeness of distributions in ii as a subroutine. 

Theorem 9 Given access to samples from two distributions p and q over M , where \M\ = 
n, there exists an algorithm that takes 0(n^/'^e~^lognlog(l/(5)) samples and (1) accepts with 
probability at least 1 — 5 if p = q and (2) rejects with probability at least 1 — 5 if \\p — q\\i > £■ 

Alternatively, via a simple Chernoff bound analysis similar to [3], we can show that a whole 
distribution can be approximated efficiently, giving us another closeness tester for li. 

Lemma 10 Given access to samples from a distribution p over M , where \M\ = n, and pa- 
rameters e,5,t > 0, there exists an algorithm that takes 0{t^^e^'^ lognlog{l/5)) samples and 
outputs a distribution p over M such that with probability at least 1 — 5 

(1 — e) max{p{i),t} < p{i) < (1 + e) max{p{i) , t} 

for every i & M. 

As a result, we can simply estimate p and q with p and q, respectively, and compute their 
ii distance to get the following alternative tester, which gives us a different trade-off between 
n and e that will become useful later. 

Theorem 11 Given access to samples from two distributions p and q over M, where \M\ = 
n, there exists an algorithm that takes 0(ne^^ log n log(l/5)) samples and (1) accepts with 
probability at least 1 — 5 if p = q and (2) rejects with probability at least 1—5 if \\p — q\\i > £■ 

3 Closeness tester 

In this section, we consider the EMD-closeness testing problem when the domain is M C [0, A]'^. 
The main idea behind the algorithm is to embed EMD into the £i metric and use an £i-closeness 



5 



tester (Theorems [9] and [TT]) to test the resulting distributions. Recah from the prehminaries, 
j?^*) and g^*) are the z-coarsening approximations of p and q. We have the following algorithm, 
where the subroutine ^i-Closeness-Tester(p, g, e, 5) is an ^i-closeness tester on distributions 
p and q with distance parameter e and failure probability 6. 

EMD-Closeness-Tester(p, q, e) 

1 for f = 1 to log(2A(i/e) do 

2 if ^i-Closeness-Tester(p«, g«, ^^ig^L/.) ' 3log(2Ad/£) ^ ^""j^"^*^ ^^^^^ 

3 reject 

4 accept 

Note that our subroutine takes advantage of whichever tester (Theorem or requires 
fewer samples. Specifically, when d is small, the domains of the z-coarsenings of p and q are 
small and e is the bottleneck, so we use Theorem II II On the other hand, if d is large, the sizes 
of these domains become the bottleneck and we use Theorem [9l This gives us the following 
theorem. 



Theorem 12 The above is an EMD-closeness tester for distributions over M C [0,A]'^ that 
takes 0{{2Ad/e)'^'^^^) samples when d > 6 and 0((A/e)^) samples when d < 2. 

Proof li p = q, then p^^^ = g*-*) for all i, so by the union bound, the probability that the 



algorithm rejects is at most log ■ ^^^^2Ad 



1 

3- 



If, on the other hand, EMD{p, q) > e, then by Lemma [8l 

^log(2AcZ/£) 



d 



E 

1=1 



A 




It follows by the pigeonhole principle that there exists an index i such that 



P 



> 



Adlog{2Ad/e)' 



Hence, for that index i, the £i-closeness tester in Step 2 will reject (with probability 2/3). 

Now let us analyze the number of samples the algorithm needs. In the i^^ iteration of the 
main loop, p^*-* and q^^^ has a domain with = 2^^ elements, and we need to run an £i-closeness 



tester with a distance parameter of Ei 



e2' 



-. Consider the following two cases: 



A log(2Ad/e) ■ 

d > 6: Using the algorithm of Theorem [H we get a sample complexity of 
0(nf e-) = O (^2(-/3-)^ |^4Adlog(2Ad/e) _ 

This quantity is maximized when i = log(2Ad/e), which gives us a total complexity of 
0((2Ad/e)2'^/3). 
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• d <2: Using the algorithm of Theorem II 11 we get a sample complexity of 

This quantity is maximized when i = 1, giving us a total complexity of 0((A/e)^). 



If one of the distributions is explicitly known, we can use the corresponding £i-closeness 
tester with sample complexity 0(n-'^/^e~^ logn) from [3j to similarly get the following theorem. 

Theorem 13 There exists an EMD-closeness tester for distributions over M C [0, A]^, where 
one is explicitly known, that takes 0((2A/e)°'/^) samples when d > 4. 

4 Additive-error estimation 

We have seen that in £i-closeness testing, sometimes it is to our advantage to simply estimate 
each probability value, rather than use the more sophisticated algorithm of [Ij. This seemingly 
naive approach has another advantage: it gives an actual numeric estimate of the distances, 
instead of just an accept/reject answer. Here, we use this approach to obtain an additive 
approximation of the EMD of two unknown distributions over M C [0, A]^ as follows. 

EMD-Approx(p, q, e) 

1 Let G be the grid on [0, A]*^ with side length and let P and Q be the distributions 
induced by p and q on G, with weights in each cell concentrated at the center 

2 Take 0{{4dA/e)'^~^'^) samples from P and Q, and let P and Q be the resulting 
empirical distributions 

3 return EMD{P,Q) 

Theorem 14 EMD-Approx takes 0((4dA/e)'^"'"^) samples from p and q and, with probability 
2/3, outputs an e- additive approximation of EMD{p,q). 

Proof Note that a sample from p or q gives us a sample from P or Q, respectively, so it 
remains to prove correctness. Observe that |G| = (AdA/e)'^, so G has 2^'^'^^/^) subsets. By the 
Chernoff bound, with the 0{{4dA/e)'^~^'^) samples from p, we can guarantee for each S C G 
that \P{S) - P{S)\ > 4^ with probability at most 2-('^'^^/^)V3- By the union bound, with 
probability at least 2/3, all subsets of G will be approximated to within an additive In 
that case, 

Y,\P{c)-P{c)\ 

ceG 

2max\P(S) - P(S)\ < -4ir- 



^'-^lli = 
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We then have, by Corollary [5l EMD(P, P) < e/4. Further, since each cell has radius e/4, 
we have EMD{p, P) < e/4, giving us by the triangle inequality, EMD(p, P) < e/2. Similarly, 
EAID{q,Q) < e/2, so again by triangle inequality, we get 

\EMD{p, q) - EMD{P, Q)\ < EMD{p, P) + EMD{q, Q) = e, 

completing our proof. ■ 



5 Lower bounds 

We can show that our tester is optimal for the 1-dimensional domain by a simple argument: 

Theorem 15 Let A he an EMD-closeness tester of distributions over any domain M C [0, A]. 
Then A requires r2((A/e)^) samples. 

Proof Consider two distributions p and q over the domain {0, A}, where p is the distribution 
that puts the weight of ^ at and A and q is the distribution that puts the weight ofl/2 + e/A 
at and the weight of 1/2 — e/A at A. Clearly EMD{jp,q) = e, and it is a classic result that 
distinguishing p from q requires r2((A/e)^) samples. ■ 

Clearly this also implies the same lower bound for 2-dimensional domains, making our al- 
gorithm optimal in those cases. Next we prove that our d-dimensional tester is also essentially 
optimal in its dependence on A/e. 

Theorem 16 There is no EMD-closeness tester that works on any M C [0,A]'^ that takes 
o((A/e)^'^/'^) samples. 

Proof Suppose A is an EMD-closeness tester that requires only o((A/e)^'^/^) samples. Then 
consider the following ^i-closeness tester for e = 1: 

^i-Tester(p, q, e = 1) 

1 Let G be a grid on [0, A]'^ with side length An"-'^/'^ 

2 Let / be an arbitrary injection from [n] into the lattice points of G 

3 Let P and Q be distributions on the lattice points of G induced by / on p and q, resp. 

4 return A{P, Q, ^An'^/'^) 

Correctness is easy to see: if p = q, then clearly P = Q as well and the tester accepts; 
alternatively, if — g^Hi = 1, then by Corollary [5] and the observation that — Q||i = — 

EMD{P,Q) > ''^"^''^ • An-i/'^ = ^An-^/^ 
so the tester rejects, as required. 
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Now, to take a sample from P (or Q), we simply take a sample x from p (or q) and return 
f{x). Hence, the sample complexity of this tester is 



A \ 



o(n2/3) 



(iAn-i/rf J 

But this contradicts the lower bound for £i-closeness testing from [HHH], completing our proof. 



If one of the distributions is known, we can similarly use the weaker 0(n^/^) lower bound 
for uniformity testing to obtain the following theorem. 

Theorem 17 There is no EMD-closeness tester that works on any M C [0, A]'^, where one of 
the input distributions is explicitly known, that takes ©((A/e)*^/-^) samples. 



6 Clusterable distributions 

Since the general technique of our algorithms is to forcibly divide the input distributions into 
several small "clusters," it is natural to consider what improvements are possible when the 
distributions are naturally clusterable. We obtain the following substantial improvement from 
the exponential dependence on the dimension d that we had in the general case. 

Theorem 18 If the combined support of distributions p and q can be partitioned into k clusters 
of diameter e/2, and we are given the k centers, then there exists an EMD-closeness tester for 
p and q that requires only 0{k'^^^{dA/e)'^) samples. 

Proof Let us denote the set of centers by C = {Ci, . . . , Cfc}. Consider the distributions P 
and Q on C induced by p and q, respectively, by assigning each point to its nearest center. If 
EMD{p,q) > e, byLemmaEl Jl^^^(dA) > e/2. We can, of course, obtain samples from P and 
Q by sampling from p and q, respectively, and returning the nearest center. Our problem thus 
reduces to ^i-testing for (^)-closeness over k points, which requires 0(A:2/3(dA/e)'^) samples 
using the ^i-tester from [1]. ■ 

If we do not assume knowledge of the cluster centers, then we are still able to obtain the 
following only slightly weaker result. 

Theorem 19 If the combined support of p and q can be partitioned into k clusters of diameter 
e/4, then even without knowledge of the centers there exists an EMD-closeness tester for p and 
q that requires only d{kdA/e + k'^/'^{dA/e f) < d{k{dA/e)'^) samples. 

To prove this, we need the following result by Alon et al. that was implicit in Algorithm 1 
from [ij. 
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Lemma 20 (Algorithm 1 from fl^) There exists an algorithm which, given distribution p, re- 
turns k' < k representative points if p is {k,b)-clusterable, or rejects with probability 2/3 if p is 
'J -far from {k,2b)-clusterable, and which requires only 0{k log k/^) samples from p . Moreover, 
if the k' points are returned, they are with probability 2/3 the centers of a [k, 2b) -clustering of 
all but a ^-weight of p. 

Proof (of Theorem \19i) By the lemma, if our distributions are (A:, e/4)-clusterable, using 
0{kdA/e) samples we obtain a (fc', e/2)-clustering of all but an ^^-fraction of the support of 
p and q, with centers C Note that the unclustered probability mass contributes at most e/4 to 
the EMD. The theorem then follows from an identical argument as that of Theorem 1181 (since 
we now know the centers). ■ 

Note that in general, if we assume {k, 6)-clusterability, this implies ((26/e)'^/i;, e/2)-clusterability 
(by packing ii balls), where knowledge of the super-cluster centers also implies knowledge of 
the sub-cluster centers. Similarly, in the unknown centers case, {k, 6)-clusterability implies 
((86/e)'^/c, e/4)-clusterability. Unfortunately, in both cases we reintroduce exponential depen- 
dence on d, so clusterability only really helps when the cluster diameters are as assumed above. 

7 EMD over tree- metrics 

So far we have considered only underlying £i-spaces (or, almost equivalently, ip). We will now 
see what can be done for EMD over tree-metrics, and prove the following result. 

Theorem 21 If p and q are distributions over the nodes of a tree T, with edge weight func- 
tion w{-), then there exists an e- additive- error estimator for EMD{p,q) that requires only 
0{{Wn/e)'^) samples, where W = maxeu;(e), n is the number of nodes of T , and EMD{p, q) 
is defined with respect to the tree metric of T . Moreover, up to polylog factors, this is optimal. 

Proof First, let us consider an unweighted tree T over n points (i.e., where every edge has 
unit weight), with distributions p and q on the vertices. Observe that the minimum cost flow 
between p and g on T is simply the flow that sends through each edge e just enough to balance 
p and q on each subtree on either side of e. In other words, if T^. is an arbitrary one of the two 
trees comprising T — e, 

EMD(p,q) = Y,\v{Te)-q{Te)\. 
e 

Then, with 0(n? /e^) samples we can, for every Te, estimate p{Te) and q{T(,) to within =1= 2(71-1) • 
This gives us an e-additive estimator for EMD{p,q). 

Generalizing to the case of a weighted tree, where edge e has weight w{e), we have 

EMD{p,q) = Y^HeMTe) - q{Te)\. 
e 

It then suffices to estimate each p{T(.) and q{Te) term to within ^ 2w{e){n-i) • Thus, 0{{Wn/ e)"^) 
samples suffice, where W = maxet(;(e). 



10 



Note that what we we get is not only a closeness tester but also an additive-error estimator. 
In fact, even if we only want a tester, this is the best we can do: in the case where T is a line 
graph (with diameter n — 1), the standard biased-coin lower bound implies we need ^^(n^/e^) 
samples. ■ 
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